To  appear  at  SPIE  Conf.  on  Three-Dimensional  Imaging,  Optical  Metrology,  and  Inspection  V,  (SPIE  3835),  Boston  MA,  Sep.,  1999. 

a# 

i. 

Recovery  of  Piece- Wise  Planar  and  Piece-Wise  Rigid  Models 

from  Non-Rigid  Motion 

Jonathan  Alon  and  Stan  SclarofF 

Image  and  Video  Computing  Group 
Computer  Science  Department 
Boston  University 
Boston,  MA  02215,  USA. 

ABSTRACT 

We  present  a  framework  for  estimating  3D  relative  structure  (shape)  and  motion  given  objects  undergoing  non-rigid 
deformation  as  observed  from  a  fixed  camera,  under  perspective  projection.  Deforming  surfaces  are  approximated  as 
piece-wise  planar,  and  piece-wise  rigid.  Robust  registration  methods  allow  tracking  of  corresponding  image  patches 
from  view  to  view  and  recovery  of  3D  shape  despite  occlusions,  discontinuities,  and  varying  illumination  conditions. 
Many  relatively  small  planar/rigid  image  patch  trackers  are  scattered  throughout  the  image;  resulting  estimates 
of  structure  and  motion  at  each  patch  are  combined  over  local  neighborhoods  via:  an  oriented  particle  systems 
formulation.  Preliminary  experiments  have  been  conducted  on  real  image  sequences  of  deforming  objects  and  on 
synthetic  sequences  where  ground  truth  is  known. 
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1.  INTRODUCTION 

Estimation  of  3D  structure  (shape)  and  motion  from  2D  image  sequences  has  been  a  central  problem  in  computer 
vision  for  many  years.  Many  early  studies  focused  on  methods  of  relating  pixel  coordinates  to  3D  coordinates  via 
camera  calibration,^’^  that  is  computing  the  projection  matrix  which  relates  image  coordinates  to  a  world  coordinate 
frame.  In  recent  years,  the  focus  has  shifted  to  non-metric  reconstruction  from  uncalibrated  cameras,^  by  computing 
the  fundamental  matrix  (two  views) and  the  trilinear  tensor  (three  views). ^  Also,  different  camera  models  were 
assumed;  ie.,  orthographic,®’*^  perspective  projection,®’^  or  a  unified  model.®’^® 

Determining  the  geometric  relationship  between  various  views  of  the  environment  and  its  3D  structure  is  a  key 
component  in  a  myriad  of  practical  applications:  reverse  engineering,  virtual  reality,  visualization,  surgical  planning, 
movie  special  effects,  computer  aided  design,  non-tactile  inspection,  manufacturing,  image  compression,  etc.  When 
3D  shape  and  motion  estimates  are  computed  in  real  time,  they  can  be  used  to  support  applications  where  a  computer 
(or  robot)  must  interact  with  its  environment:  manipulation,  navigation  and  control,  tracking,  etc.  Furthermore, 
such  estimates  can  be  utilized  to  determine  the  locations,  postures,  and  configurations  of  humans  in  order  to  enable 
a  computer  to  assist  (or  avoid  hampering)  in  a  task. 

Despite  the  many  exciting  applications  and  the  energetic  progress  of  research  in  structure  and  motion  recovery 
algorithms,  many  problems  remain  unsolved.  Some  of  these  issues  are  related  to  numerical  stability  and/or  ambiguity 
of  the  solution  under  general  conditions. Other  problems  stem  from  the  rich  variety  of  shapes  and  motions 
that  are  possible  in  the  world.  In  particular,  many  shapes  can  be  non-planar  and/or  their  motion  can  be  non-rigid. 
Unfortunately,  all  of  the  above-mentioned  approaches  assume  that  object  points  in  3D  space  must  remain  at  fixed 
distances  from  each  other  during  motion. 

>i'  . 

Our  goal  is  to  extend  these  approaches  to  non-rigid  objects.  We  propose  a  method  for  recovering  3D  shape  and 
motion  estimates  for  objects  undergoing  non-rigid  deformation  as  observed  from  a  fixed  camera,  under  perspective 
projection.*  A  natural  first  step  to  take  towards  solving  this  problem  is  to  assume  that  the  deforming  object  consists 
of  small  patches  that  are  rigid  and  planar  when  considering  small  enough  regions.  In  other  words,  we  will  employ  a 
representation  where  deforming  surfaces  will  be  approximated  as  piece-wise  planar,  and  piece- wise  rigid. 

A  second  assumption  common  to  many  of  these  approaches  is  that  correspondence  between  features  in  different 
views  is  given.  As  will  be  outlined  later,  we  utilize  a  tracker  that  automatically  registers  moving  image  patches  from 

*It  is  assumed  that  self  calibration  of  the  camera  will  be  given  or  obtained  via  a  standard  technique 
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frame  to  frame. Each  corresponding  warped  image  patch  is  then  used  directly  in  estimating  the  3D  orientation  of 
the  piece- wise  planar  surface  patch,  and  its  3D  position  up  to  a  scale  factor.  A  robust  image  registration  formulation 
provides  stability  to  shadows,  highlights,  and  partial  occlusions.  Furthermore,  changes  in  illumination  are  modeled 
explicitly. 

Two  different  approaches  for  acquiring  piece- wise  rigid/planar  models  are  possible:  top-down  and  bottom-up. 
In  the  top-down  method,  the  initial  hypothesis  could  be  that  an  object’s  motion  can  be  adequately  modeled  as  a 
single  moving  rigid/planar  patch^^;  the  model  would  then  be  subdivided  and  augmented  as  needed  to  account  for 
non-planar /non-rigid  motion  via  an  adaptive  triangulation  procedure.  In  the  second,  bottom-up  approach,  many 
relatively  small  planar/rigid  image  patch  trackers  could  be  scattered  throughout  the  image;  resulting  estimates  of 
structure  and  motion  at  each  patch  would  then  be  combined  over  local  neighborhoods  via  an  extension  of  Szeliski’s 
oriented  particle  systems  formulation.^®*^® 

In  our  preliminary  system,  we  have  developed  the  bottom-up  approach,  and  will  report  these  results.  The  bottom- 
up  framework  is  evaluated  using  synthetic  data  in  which  ground  truth,  deformation,  and  noise  levels  are  known.  The 
method’s  efficacy  is  also  demonstrated  on  real  image  sequences  of  deforming  objects.  Implementation  of  the  top-down 
approach,  and  experimental  comparison  of  both  strategies,  is  saved  as  future  work. 

2.  BACKGROUND 

The  many  years  of  work  in  structure  from  motion  have  led  to  significant  advances  in  recovery  of  detailed,  texture 
mapped  models  and  motion  estimates  from  video  to  support  graphics,  visualization,  and  compression,  A  number 
of  researchers  have  demonstrated  systems  that  can  recover  planar  models  and  texture  maps  from  image  streams; 

to  name  a  few.  Other  researchers  have  demonstrated  methods  for  recovering  polygonal  models  of  an 
object  that  is  positioned  on  a  rotating  platform.^^"^® 

Other  approaches  focus  on  the  problem  of  structure  from  tracked  feature  points  (or  lines)  with  known  correspon¬ 
dence  from  two  or  more  frames,  under  orthographic  or  perspective  projection.^^*^  If  desired,  a  polygonal  model  can 
be  recovered  from  the  resulting  collection  of  unorganized  3D  point  position  estimates  via  triangulation^®’^^*^®  or  via 
surface  approximation.^®’^^  < 

In  point  based  methods,  feature  tracking  and  correspondence  is  assumed.  Such  tracking  can  be  attained  via  any 
number  of  techniques.  Typically,  image  correlation  or  sum  of  squared  differences  methods  are  used.^^  A  point 
feature  is  essentially  a  small  image  patch,  which  is  tracked  by  optimizing  some  matching  criterion  with  respect  to 
translation  or  affine  image  deformation.  Selection  of  good  points  to  track  can  be  based  on  a  number  of  factors, 
including  corners,  texture,  sufficient  zero  crossings  in  the  Laplacian  of  image  intensity,  Unfortunately,  even 

a  “good”  feature  can  be  difficult  to  track  if  it  lies  on  a  depth  discontinuity,  or  across  the  boundary  of  a  specular 
highlight,  or  if  it  is  occluded  during  tracking.  Such  problems  beg  the  use  of  smaller  feature  windows,  since  smaller 
windows  tend  to  be  less  likely  to  straddle  discontinuities.  However,  there  is  a  tradeoff:  estimates  based  on  smaller 
windows  t^nd  to  be  more  susceptible  to  noise  and  outliers,  since  there  are  fewer  pixels  per  feature  window  tracked. 

Another  set  of  methods  is  based  on  image  registration.  Take  for  example,  the  plane  plus  parallax  methods 

18,33-35  These  methods  exploit  a  dominant  planar  motion  to  compute  the  epipoles  and  perform  a  projective 
reconstruction.  Such  methods  can  use  robust  minimization  methods^®  to  overcome  the  influence  of  outliers. 

All  of  the  methods  mentioned  so  far  assume  rigid  motion  in  order  to  recover  a  model.  This  limits  the  utility 
of  the  above  methods  to  recovery  of  rigid  structure  and  motion  estimates.  In  images,  the  deformational  motion  of 
objects  is  sometimes  due  to  changes  in  viewing  geometry.  In  many  such  cases,  the  above  mentioned  methods  are 
sufficient.  However,  in  general,  these  parameterizations  are  inadequate  for  representing  motions  that  arise  due  to  a 
general  non-rigid  deformation.  For  instance,  most  biological  objects  are  flexible  and  articulated:  fingers  bend,  cheeks 
bulge,  fish  swim,  trees  sway  in  the  breeze,  etc.  Shapes  are  stretched,  bent,  tapered,  dented,  etc.,  and  so  it  seems 
logical  to  employ  a  model  that  can  encode  the  ways  in  which  real  objects  deform. 

This  rationale  led  to  the  development  of  3D  active  shape  models.®*^  These  models  utilize  a  predefined  structure 
that  incorporates  prior  knowledge  about  a  shape’s  smoothness  and  its  resistance  to  deformation.  A  number  of 
different  3D  deformable  model  formulations  have  been  proposed;  e.g.j  deformable  tubes, ellipsoidal  models,®®’^® 
superquadrics, etc.  Perhaps  the  major  limitation  of  such  methods  is  the  requirement  that  every  object  be 
described  as  the  deformations  of  a  single  prototype  object.  This  limits  the  kinds  of  shapes  (and  topologies)  that  can 


Figure  1.  Construction  of  an  example  image  patch  model  via  active  blobs.  From  left  to  right:  a.)  input  image  with  region 
of  interest  overlaid,  b,)  triangle  mesh  model,  c.)  texture  mapped  model. 


be  recovered  in  general,  since  we  can  only  recover  shapes  that  are  achievable  via  the  specific  geometric  model  and 
non-rigid  motion  formulation. 

Some  researchers  attempt  to  overcome  this  limitation  through  the  use  of  more  general,  3D  deformable  part 
decompositions,^^  local  deformations, shape  evolution  models, or  adaptive  subdivision.^®"^®  These 
methods  offer  greater  generality,  but  are  still  somewhat  limited  in  the  shapes  and  deformations  they  can  describe 
in  general.  Furthermore,  these  techniques  sometimes  require  careful  initial  placement  of  the  model,  reliable  feature 
detection  for  model-image  correspondence,  or  the  delicate  choice  of  model  parameters  {e.g.,  stiffness). 

A  second  assumption  common  to  many  of  the  above  approaches  is  that  the  correspondence  between  features 
in  the  different  views  is  known.  To  get  around  this  problem,  we  will  use  a  tracker  that  automatically  determines 
correspondence  via  registration  of  image  patches  from  frame  to  frame,  as  described  in  the  next  section. 

3.  TRACKING  DEFORMING  IMAGE  PATCHES 

A  key  component  of  the  proposed  approach  is  tracking  visible  parts  of  objects  firom  frame  to  frame.  A  promising  family 
of  approaches  is  based  on  tracking  of  deforming  image  regions. These  approaches  integrate  information 
over  an  image  patch,  and  therefore  tend  to  be  more  immune  to  noise  and/or  low-contrast,  especially  if  a  robust 
estimator  formulation  is  employed.®®  Typically,  use  of  a  robust  approach  requires  batch  processing,  though  multiscale 
techniques  offer  some  hope  for  real-time  performance.  Real-time  approaches  for  tracking  of  parameterized  patches 
have  been  developed®®’®^;  however,  they  do  not  address  general  non-rigid  motion  tracking. 

3.1.  Active  Blobs  Formulation 

More  general  non-rigid  motion  tracking  can  be  accomplished  via  the  active  blobs  formulation  of,^*^  The  formulation 
provides  robustness  to  occlusions,  wrinkles,  shadows,  and  specular  highlights.  Furthermore,  it  is  tailored  to  take 
advantage  of  texture  mapping  hardware  available  in  many  workstations,  PC’s,  and  game  consoles.  This  enables 
non-rigid  tracking  at  speeds  approaching  video  rate. 

In  the  active  blobs  formulation,  shape  of  the  image  patch  is  modeled  with  a  deformable  triangular  mesh.  The 
construction  of  an  example  active  blob  model  is  shown  in  Fig.  1.  Fig.  1(a)  shows  the  first  image  in  a  sequence  with 
regions  of  interest  outlined.  A  2D  active  triangular  image  patch  model  is  then  constructed  for  the  region  of  interest 
as  shown  in  Fig.  1(b).  The  blob’s  appearance  is  then  captured  as  a  color  texture  map  and  applied  directly  to  the 
triangulated  model  as  shown  in  Fig.  1(c). 

For  tracking,  the  active  blob  model  is  warped  such  that  it  is  registered  with  the  incoming  image  sequence.  Warping 
is  defined  as  a  deformation  of  the  triangular  mesh  and  then  a  bilinear  resampling  of  the  texture  mapped  triangles. 
In  essence,  texture  mapping  is  used  to  define  a  warping  function  for  the  input  image,  I: 

I'  =  cW(I,u)+b  =  W(I,a),  (1) 

where  u  is  a  vector  containing  deformation  parameters,  and  b  and  c  model  brightness  and  contrast  variations.  For 
notational  convenience,  we  concatenate  the  parameters  u,  b,  c  together  in  a  generic  parameter  vector  a,  and  define  a 
generic  warping  function  W.  In  our  current  system,  the  photometric  correction  terms  are  defined  as  bilinear  functions 
that  scale  the  red,  green,  and  blue  channels  equally. 


Perhaps  the  simplest  deformation  functions  to  be  used  in  Eq.  1  are  those  of  an  eight  parameter  projective 
model.  Such  functions  are  suitable  for  approximating  the  rigid  motion  of  a  planar  patch.  However,  since  the  piece- 
wise  planar /rigid  assumption  is  likely  to  be  violated,  we  utilize  a  parameterization  that  can  accommodate  greater 
variability. 

A  more  general  parameterization  of  non-rigid  motion  can  be  obtained  via  the  modal  representation,^^  where 
deformation  is  represented  in  terms  of  eigenvectors  of  a  finite  element  (FE)  model.  The  underlying  FE  formulation 
offers  the  added  advantage  that  it  can  be  used  in  obtaining  a  regularized  solution  to  the  non-rigid  tracking  prob¬ 
lem.  For  a  given  modal  parameter  vector  obtained  in  tracking,  we  can  compute  the  strain  energy  associated  with 
deformation:  _ 

m 

^strain  ~  ^ 

where  Uj  is  the  stiffness  associated  with  the  j*'*  modal  deformation  parameter.  Note  that  these  stiffnesses  are 
determined  directly  from  the  FE  shape  model.^^’^^ 

Recall  that  in  Eq.  1,  we  concatenate  the  deformation  and  lighting  parameters  u,  b,  c  together  in  a  generic  pa¬ 
rameter  vector  a.  Therefore,  generalized  stiffnesses  are  needed.  We  define  a  diagonal,  generalized  stiffness  matrix  ^ 
that  contains  the  modal  stiffnesses  u)j  and  stiffnesses  for  the  lighting  parameters  along  the  diagonal.  The  lighting 
stiffnesses  are  inversely  proportional  to  the  expected  variance  in  lighting,  and  estimated  via  statistical  methods.®^’^^ 

Tracking  is  then  posed  as  a  problem  of  regularized  active  blob  registration.  For  each  frame,  the  image  template 
is  warped  to  minimize  a  regularized  registration  function: 

E  =  i^p(ei,£r) +7a*'^^a  (3) 

a  =  llT(xi,yi) -I(xi,yi)|l  (4) 

where  r(xi,yi)  is  a  pixel  in  the  warped  template  (Eq.  1),  I(xi,yi)  is  the  pixel  at  the  same  location  in  the  input,  a 
and  7  are  scale  parameters,  and  p  is  an  infiuence  fimction.®^ 

The  infiuence  function  p  is  also  known  as  a  robust  error  norm.^®  It  is  equivalent  to  the  incorporation  of  an  analog 
outlier  process  in  our  objective  function.  This  results  in  better  robustness  to  specular  highlights  and  occlusions.  In 
our  experiments,  we  have  used  the  function  p(ei,  a)  =  log(l  -t-  ef/(2cr2)).®®47  por  efficiency,  the  log  function  can  be 
implemented  via  table  look-up. 

3.2.  Robust  Registration  Algorithm 

Registration  requires  minimization  of  residual  error  (Eq.  3)  with  respect  to  the  deformation  and  lighting  parameters. 
A  common  approach  to  multi-dimensional  minimization  problems  is  the  Marquardt-Levenberg  method.  Marquardt- 
Levenberg  requires  the  calculation  of  0{N)  gradient  images  and  0{N^)  image  products  per  iteration  of  minimization, 
where  N  is  the  number  of  model  parameters.  To  decrease  the  number  of  gradient  calculations  needed,  we  can  use  a 
difference  decomposition.^°’^^’^'^  The  approach  only  requires  the  equivalent  of  0(1)  image  gradient  calculations  and 
0{N)  image  products  per  iteration. 

In  the  difference  decomposition,  a  set  of  difference  images  is  generated  by  adding  small  changes  to  each  of  the 
blob  parameters.  Each  difference  image  takes  the  form: 

bk  =  Io-W(Io,nk),  (5) 

where  Iq  is  the  template  image,  and  Uk  is  the  parameter  displacement  vector  for  the  difference  image,  bk-  Each 
difference  image  becomes  a  column  in  the  matrix  B.  The  difference  matrbc  can  be  precomputed;  this  is  the  key  to 
the  difference  decomposition’s  speed. 

During  tracking,  an  incoming  image  I  is  inverse  warped  into  the  blob’s  coordinate  system  using  the  most  recent 
estimate  of  the  warping  parameters  a.  The  difference  between  the  inverse-warped  image  and  template  is  then 
computed: 

D  =  Io-W-^(I,a).  (6) 


Figure  2.  Tracking  of  a  patch,  over  a  number  of  frames  in  a  video  sequence.  The  patch  outline  is  shown  in  white.  The 
registration  of  the  image  patch  from  frame  to  frame  implicitly  establishes  correspondence,  allowing  us  to  compute  a  least 
squares  estimate  of  the  local  surface  orientation  and  relative  depth.  The  recovered  surface  normal  is  shown  displayed  over  top 
the  input  sequence. 

This  difference  image  D  can  be  approximated  in  terms  of  a  linear  combination  of  the  difference  decomposition’s 
vectors:  D  «  Bq,  where  q  is  a  vector  of  coefficients.  Thus,  the  maximum  likelihood  estimate  of  q  can  be  obtained 
via  least  squares: 

q=  (7) 

The  change  in  the  image  warping  parameters  is  obtained  via  matrix  multiplication 

Aa  =  Nq,  (8) 

where  N  has  columns  formed  by  the  parameter  displacement  vectors  Uk  used  in  generating  the  difference  basis. 

A  robust  solution  can  be  obtained  through  inclusion  of  a  diagonal  weighting  matrix  in  Eq.  7: 

q  =  (B^S’^B)-^B*S“^D,  (9) 

where  entries  in  the  diagonal  matrix  S  take  the  form  su  =  2a^  +  D|  (here  i  is  pixel  index),  as  derived  from  the 
robust  error  norm  p. 

Finally,  the  formulation  can  be  extended  to  include  a  regularizing  term  that  enforces  the  priors  on  the  model 
parameters.  This  is  accomplished  using  a  constrained  least  squares  formulation: 

q  =  PD-Qa,  (10) 

where  P  =  B^S'^  and  Q  =  7  [B^S'^B  +  -1  jf  needed,  this  minimiza- 

tion  procedure  can  be  iterated  at  each  frame  until  the  percentage  change  in  the  error  residual  is  below  a  threshold, 
or  the  number  of  iterations  exceeds  some  maximum. 

An  example  of  tracking  and  image  patch  via  difference  decomposition  is  shown  in  Fig.  2.  Image  warping  and 
registration  implicitly  establishes  correspondences  between  views;  every  pixel  within  an  image  patch  now  has  a 
corresponding  location  in  the  next  frame.  Given  these  corresponding  pixel  locations,  we  can  recover  estimates  of 
local  planar  structure  and  surface  normal  via  least  squares^  as  described  in  the  next  section. 

4,  PIECE- WISE  PLANAR  STRUCTURE  RECOVERY 

For  a  given  collection  of  corresponding  image  points  in  two  views,  we  estimate  the  planar  patch’s  relative  position 
and  orientation  via  an  algorithm  proposed  by  Weng,  et  al?  and  similarly  presented  by  Faugeras  in.^'’'  The  approach 
employs  a  linear  algorithm  that  yields  a  closed  form  solution.  The  formulation  is  briefly  restated  here.  We  consider 
this  as  a  preliminary  formulation,  since  it  is  standard  in  the  literature;  however,  we  plan  to  evaluate  other  methods 
for  planar  structure  recovery  in  future  work.  In  particular,  multiple  frame  approaches, constrained  approaches,®® 
and  more  stable  approaches'^  seem  promising. 


Weng,  et  al?  use  an  ideal  pin  hole  camera  model  with  unit  focal  length.  A  conventional  camera  can  be  calibrated 
so  that  every  point  in  the  actual  image  plane  can  be  transformed  to  a  point  in  the  image  plane  of  this  normalized 
model.  Consider  a  point  on  the  object  that  is  visible  at  two  time  instants.  The  3D  spatial  position  of  the  point  in 
the  first  instant  is  denoted  x  =  and  in  the  second  x'  =  The  image  coordinates  of  the  point, 

in  the  first  and  second  images  are  denoted  X  =  (u,u,  1)*  =  (f ,  f ,  1)*  and  X'  =  1)*  =  (p-,  1)*,  where  {u,v) 

and  (it',  u')  are  the  image  coordinates  of  the  point,  in  the  first  and  second  images  respectively.  Therefore,  the  spatial 
vector  and  image  vector  are  related  by  x  =  zX.^  x'  =  2:'X'. 

The  basic  rigid  motion  equation  that  relates  spatial  points  at  the  two  time  instances  is: 

x'  =  jfix  +  T.  (11) 

where  R  and  T  are  a  rotation  matrix  and  translation  vector  respectively.  It  is  assumed  that  the  camera  undergoes 
rotation  around  an  axis  going  through  the  origin  followed  by  a  translation.  It  is  further  assumed  that  the  world 
coordinate  system  is  centered  at  the  optical  center.  Note  that  in  monocular  sequences,  the  translation  vector  T 
and  the  depths  of  the  object  points  z  and  z'  can  only  be  determined  up  to  a  scale  factor.  Therefore  translation  is 
described  in  terms  of  a  unit  vector  and  depth  estimates  are  similarly  normalized 

The  plane  where  the  points  are  located  in  3D  space  can  be  represented 

N*x  =  l.  (12) 

where  N  is  the  planers  normal  vector.  The  distance  d  between  the  origin  and  the  plane  is  d  =  ||N||”^,  Note  that 
d^O  thus  excluding  cases  in  which  the  plane  goes  through  the  origin.  Furthermore,  since  we  can  only  determine 
depth  up  to  a  scale  factor,  we  can  only  determine  the  normal  up  to  a  scale  factor. 

From  Eqs.  11  and  12  we  get 

x' =  (i?  +  TN*)x.  (13) 

We  define  the  homography: 

F  =  iZ  +  TN*,  (14) 

which  can  be  rewritten  in  terms  of  image  vectors: 

^  z'X'  =  FzX.  (15) 

Applying  a  cross  product  with  X'  on  both  sides  of  the  equation  3delds: 

X'xFX  =  0.  (16) 

This  can  be  rewritten  in  terms  of  the  product  of  a  matrix  with  a  vector  that  contains  the  elements  of  the  homography 
h  =  (/ii, /i2, /i3) /23)  •  •  •) /ss)*: 

■  X*  0  -u'X*  ■ 

0  X*  -u'X*  h  =  0.  (17) 

_  v'X*  -u'X*  0  J 

The  third  row  is  a  linear  combination  of  the  other  two  and  thus  can  be  omitted. 

If  we  stank  these  2  rows  n  times  in  a  matrix  where  n  is  the  number  of  points  we  get  a  2n  x  9  matrix  such  that 

'  X\  0  -u'lXi  ■ 

0  X\  -uJX^^ 

A=  :  :  •:  (18) 

XJ.  0  -<X‘ 

L  0  X‘  -«J 

We  then  solve  for  unit  vector  h  =  mirih  |lAh||,  subject  to:  ||h|l  =  1  If  rank(A)  =  8,  h  can  be  solved  up  to  a  scale 
factor.  Weng  et  aP  show  that  rank(A)  =  8  if  and  only  if  there  exists  a  set  of  four  object  points  such  that  no  image 
projections  of  any  three  points  in  this  set  are  collinear  in  any  of  the  two  images.  Then  assuming  rank(A)  =  8  the 
solution  of  h  is  a  unit  eigenvector  of  A^A  associated  with  the  smallest  eigenvalue. 

Since  all  the  necessary  information  for  F  is  contained  in  h  we  are  now  ready  to  solve  for  the  rotation,  translation, 
and  plane  normal  from  F.  There  are  four  cases  to  consider  corresponding  to  the  multiplicity  of  F^F^s  eigenvalues. 
For  brevity,  these  details  are  omitted.  For  the  four  cases  and  their  geometric  interpretation  see.^ 


5.  COMBINING  SURFACE  ESTIMATES 

The  strategy  is  to  scatter  many  relatively  small  planar/rigid  image  patch  trackers  throughout  the  image.  Using  the 
procedure  described  above,  a  separate  3D  position  and  orientation  estimate  is  recovered  for  each  image  patch.  It 
is  possible  that  structure  estimates  will  be  noisy.  A  regularized  solution  can  be  obtained  by  combining  the  piece- 
wise  shape  estimates  over  local  neighborhoods  via  an  extension  of  Szeliski  and  Tonnesen’s  oriented  particle  systems 
formulation.^^’^®  Using  this  approach,  complex  surfaces  are  modeled  as  sets  of  local  surface  elements  that  interact 
with  each  other.  Interaction  potentials  are  devised  that  cause  particles  to  move  into  locally  smooth  arrangements 
subject  to  external  forces  that  are  derived  from  the  image-based  piece-wise  structure  estimates. 

Unlike  the  particle  systems  commonly  used  in  computer  graphics,  our  oriented  particle  system  is  massless.  Instead, 
the  formulation  utilizes  potentials  that  enforce  priors  on  surface  bending.  This  difference  in  formulation  is  due  to  the 
particular  goal  of  our  application:  regularization  of  the  piece-wise  planar/rigid  structure  estimates.  Following, 
we  define  a  co-normality  potential  (j)^  and  co-planarity  potential  (j>f^  between  particles  i  and  j: 

=  l-ni-nj,  (19) 

<l>fj  =  (“i  •  +  (nj  •  ^y)^  (20) 

where  ni  and  nj  are  the  unit  normals  for  two  piece- wise  planar  patches,  and  ry  is  the  vector  connecting  the  two 
patch  centers.  These  two  terms  determine  the  surface’s  resistance  to  bending. 

In  the  simulation,  the  potentials  are  combined  in  an  internal  energy  term  that  sums  the  inter-particle  energies: 

^internal  ~  ^  “b  (21) 

U 

where  a  is  a  scale  factor  that  controls  the  relative  importance  of  the  terms,  and  jd  is  a  monotonically  decreasing 
function  used  to  limit  the  range  of  the  forces  and  torques  derived  from  the  potential  energy  function.  For  this 
application,  the  function  we  use  is  I3{rij)  =  max(l  -  0),  where  d  is  the  desired  falloff  distance,  and  m 

controls  the  rate  of  falloff. 

Due  to  this  falloff,  a  particle  is  affected  by  forces  and  torques  exerted  by  the  other  particles  only  within  its  local 
neighborhood  A/i.  Equations  for  the  forces  and  torques  can  be  found  For  numerical  conditioning,  a  damping 
term  is  added  to  both  force  and  torque  equations. 

To  gain  a  regularized  estimate  of  the  piece- wise  surface,  we  run  a  particle  simulation.  We  define  two  sets  of 
particles  in  the  simulation:  surface  particles  and  data  particles.  One  surface  particle  and  one  data  particle  are 
defined  for  each  piece- wise  planar  surface  estimated  in  the  image.  The  initial  value  of  each  surface  and  data  particle 
is  the  position  and  orientation  estimated  via  tracking  as  described  in  Sec.4.  Data  particles  remain  fixed  during  the 
simulation,  while  surface  particles  are  free  to  move.  Each  pair  of  data  and  surface  particles  can  be  joined  by  a  linear 
spring. 

The  particle  system’s  behavior  is  described  by  an  ordinary  differential  equation, and  integrated  in  time  via 
Euler’s  method  until  the  change  in  the  potential  energy  between  iterations  goes  beneath  a  threshold.  The  regularized 
piece- wise  surface  is  taken  as  the  position/orientation  of  the  surface  particles  at  the  end  of  the  simulation. 

It  is  possible  that  there  are  depth  discontinuities  present  in  the  scene,  and  therefore  particles  may  lie  on  different 
sides  of  a  depth  discontinuity.  The  forces  that  bind  particles  should  therefore  be  modeled  as  springs  that  break  apart 
if  particles  are  too  far  out  of  alignment.^® 

The  advantage  of  using  the  oriented  particle  system  approach  is  that  it  requires  no  a  priori  knowledge  of  the  piece- 
wise  surface’s  topology.  One  disadvantage  is  that  the  approach  requires  careful  parameter  setting.  Furthermore,  the 
computational  complexity  of  simulation  is  prohibitive  for  large  particle  systems;  each  update  of  the  system  requires 
the  calculation  of  O(n^)  inter-particle  forces.  The  complexity  issue  can  be  addressed  through  the  use  of  spatial  data 
structures.^® 


6.  “GOOD”  IMAGE  PATCHES  TO  TRACK 

Piece-wise  structure  recovery  depends  on  the  registration  of  deforming  image  patches  from  frame  to  frame.  In  our 
proposed  system,  the  strategy  is  to  track  many  patches  at  a  time.  Some  patches  will  be  relatively  “good”  and  will 
allow  accurate  tracking  of  deformation.  Other  patches  may  present  problems  in  deformable  region  tracking,  and 
should  be  detected. 

For  instance,  some  image  patches  may  have  relatively  low  contrast  and  therefore  will  be  unfit  for  tracking.  More 
generally,  we  need  to  anticipate  and  deal  with  the  aperture  problem  in  estimating  patch  motion.  At  each  pixel,  it 
is  only  possible  to  estimate  that  component  of  image  velocity  that  is  orthogonal  to  an  image  isobrightness  contour. 
One  solution  to  this  problem  is  to  calculate  motion  over  larger  image  patches.  Since  we  are  tracking  relatively  large 
image  patches  (on  the  order  of  16  x  16  or  32  x  32  pixels),  it  is  often  possible  to  resolve  the  aperture  problem,  assuming 
sufficient  image  contrast. 

However,  in  general,  there  will  still  be  some  image  patches  for  which  it  is  impossible  to  reliably  estimate  the 
motion  parameters  due  to  the  aperture  problem.  In  certain  cases,  parameter  estimates  may  be  ambiguous  or  under¬ 
constrained.  This  is  a  generalization  of  the  aperture  problem.®^  It  effects  not  only  estimates  of  translational  motion, 
but  estimates  of  deformational  motion  as  well.  It  may  be  possible  to  reliably  estimate  only  a  subset  of  deformation 
parameters  given  an  image  patch  of  a  particular  texture.  This  ambiguity  can  be  detected  by  computing  the  rank 
of  the  matrix  B*B  employed  in  image  registration  (Sec.3.2).  If  this  matrix  is  rank  deficient,  then  there  will  be  an 
inherent  ambiguity  in  tracking  for  that  patch. 

More  generally,  B^B  serves  as  the  estimated  covariance  matrix  of  the  standard  errors  in  the  recovered  registration 
parameters  for  each  patch.  These  covariances  could  be  incorporated  directly  into  the  structure  recovery  and  in  the 
oriented  particle  simulation.  This  would  allow  resolution  of  possible  ambiguities  by  pooling  over  neighborhoods,  and 
is  saved  as  future  work. 

Unfortunately,  even  a  “good”  image  patch  can  be  difficult  to  track  if  it  lies  on  a  depth  discontinuity,  across  the 
boundary  of  a  specular  highlight,  or  if  it  is  occluded  during  tracking.  The  use  of  the  influence  function  formulation 
in  registration  provides  improved  robustness  to  these  effects.  The  particular  robust  error  norm  employed  reaches  its 
theoretical  break  down  point  when  the  number  of  outliers  exceeds  50%.  As  suggested  by,®^  patches  that  straddle 
depth  discontinuities  can  be  detected  by  inspecting  the  residual  error  in  registration  at  each  step. 

r.  PRELIMINARY  EXPERIMENTS 

To  test  the  capabilities  of  our  proposed  framework,  we  built  an  experimental  implementation  of  the  piece- wise  planar 
tracking  system.  Our  system  was  implemented  on  an  SGI  02  with  a  ISOMhz  R5K  processor,  128MB  RAM.  At 
this  time,  only  the  tracking  and  piece-wise  structure  modules  have  been  fully-tested.  The  particle  system  module 
has  undergone  preliminary  testing  with  planar  motion  sequences.  Full  integration/evaluation  of  the  particle  system 
module  is  expected  for  the  final  version  of  this  paper. 

The  basic  piece-wise  structure  approach  was  tested  on  synthetic  sequences  in  which  ground  truth  was  known. 
The  experimental  setup  for  generating  synthetic  sequences  was  as  follows.  A  polygonal,  texture  mapped  model  was 
rendered  under  perspective  projection  using  OpenGL  at  128  x  128  resolution.  The  resulting  image  sequence  was  then 
used  as  a  test  sequence.  For  visualization  purposes,  the  recovered  normal  and  patch  location  were  then  displayed 
overlaid  on  the  input  frames.  Additional  orthographic  views  were  displayed  for  ease  of  viewing. 

The  system  was  tested  on  approximately  twenty  synthetic  sequences  under  varying  amounts  of  rotation,  scaling, 
translation,  and  deformation.  Two  different  3D  deformation  functions  were  used:  quadratic  bending,  and  helical 
twisting.  Illumination  was  kept  fixed,  since  previous  experiments  with  active  blobs^*^  already  demonstrated  showed 
robustness  of  the  tracker  to  illumination.  Each  image  region  tracked  was  32  x  32  pixels  in  size. 

Results  for  two  different  synthetic  sequences  are  shown  in  Figs.  3  and  4.  In  both  figures,  the  first  frame  in  the 
input  sequence  is  shown  in  (a),  with  the  initial  position  of  image  patches  shown  overlaid  in  white.  Subsequent  frames 
in  the  sequence  are  shown  in  (b).  Ground  truth  normals  are  shown  in  green.  Estimated  normals  are  shown  in  red.  To 
better  visualize  the  result,  orthographic  views  of  the  surface  an  normals  are  shown  below  each  image  in  the  sequence 
M). 

Since  the  polygonal  model  and  the  deformation  were  known,  ground  truth  structure  and  normal  information  was 
readily  available.  This  allowed  us  to  compute  error  in  orientation  estimates.  Throughout  the  synthetic  sequences 
tested,  the  dot  product  between  the  estimated  and  ground  truth  normals  had  an  average  value  of  0.97  (15°). 


r 


Figure  3.  Excimple  tracking  with  synthetic  sequence:  twisting.  A  perspective  image  sequence  was  generated  for  a  deforming 
plane.  The  first  frame  in  the  input  sequence  is  shown  in  (a),  with  the  initial  position  of  image  patches  shown  overlaid  in 
white.  Frames  taken  from  later  in  the  input  sequence  are  shown  in  (b).  Ground  truth  normals  are  shown  in  green.  Estimated 
normals  are  shown  in  red.  To  better  visualize  the  result,  corresponding  orthographic  side  views  are  shown  below  each  image 
in  the  sequence  (c,d). 


Figure  4.  Second  example  of  tracking  with  synthetic  sequence:  quadratic  bending  of  planar  sheet.  A  perspective  image 
sequence  was  generated  and  piece- wise  model  estimates  were  obtained  as  in  previous  example.  The  first  frame  in  the  input 
sequence  is  shown  in  (a),  with  the  initial  position  of  image  patches  shown  overlaid  in  white.  Subsequent  frames  are  shown  in 
(b).  As  before,  ground  truth  normals  are  shown  in  green  and  estimated  normals  are  shown  in  red.  Corresponding  orthographic 
top  views  are  shown  below  each  image (c,d). 


The  system  has  also  been  tested  on  real  image  sequences  of  deformable  objects  in  motion.  Frames  taken  from 
a  tracking  sequence  of  piece  of  a  foam  rubber  block  deforming  are  shown  in  Fig.  5.  As  before,  tracked  regions  are 
shown  outlined  in  white  and  estimated  normals  are  shown  in  red  (displayed  under  perspective  projection).  As  can 
be  seen,  the  results  look  reasonable  despite  the  large  deformation  and  non-rigidity.  We  expect  that  the  results  will 
improve  further  with  inclusion  of  the  particle  system  module. 


Figure  5,  Example  tracking  with  a  real  image  sequence:  a  foam  rubber  block  deforming.  As  before,  tracked  regions  are 
shown  outlined  in  white  and  estimated  normals  are  shown  in  red  (displayed  under  perspective  projection). 
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